This assignment assesses your understanding of basic statistics, probability, Bayes’ Theorem and linear regression peakl, covered in Modules 1 and 2. The total marks of this assessment is 50, and has 20% contribution to your final score.
You can complete your assignment using the codes shared in the unit (ie Alexandria, video, practical activities on Moodle) and this template as the bases. However, you should make sure the codes you are using are correct and relevant to the question.
Please follow the structure of this template as much as you can.
You can use the prepopulated codes cells or change them if you prefer. However, please do not change the name of the key variables, functions, and parameters, e.g. wineRed, wineWht. . It helps us to read and understand your submission more efficiently.
All your answers need to be put into this file, and you can write equations in R markdown, please refer to https://rmd4sci.njtierney.com/math.html#example-math-commands for more information and examples.
Suppose we have 9 balls in a box, 4 white balls and 5 red balls. All the balls have the same size and same weight, only the color is different. Each time, we can take one ball out from the box without looking at its color. so, each ball have the equal chance to be picked from the box.
Using this and properties of random variables, please answer the following questions.
Assuming there is no replacement, let R = 1 red ball (eg. RR = 2 red balls), W = 1 white ball (eg. WWW = 3 white balls).
There are a total of 9 balls, 5 of which are red (i.e. 9 choose 5). If we pick 1 ball from the 9 and it turns out to be red, then we have a total of 8 balls remaining in the box. Of those 8 balls, there are 4 red balls remaining.
If we are to pick another ball from the box, then we will be choosing from 8 balls, and from those 8 balls we are choosing 1 of 4 red balls (i.e. 8 choose 4).
Therefore the probability of choosing 2 red balls is:
\[Pr(RR) = \frac{5}{9}*\frac{4}{8}=\frac{5}{18}\]
The probability of choosing “at least” one ball is the equivalent of “Total probability - the probability of choose no white balls”. This is due to the probability axiom that total probability = 1.
\(Pr(at least one W) = 1 - Pr(no white balls)\)
Since we are drawing 3 balls, if none of them are white then we are choosing 3 red balls from 5 red balls, all out of a total of 9 balls
\(Pr(no white balls) = Pr(RRR)\)
Therefore the probability of selecting 3 red balls out of 5 red balls in a row, from a total of 9 balls is:
\[Pr(RRR) = \frac{5}{9}*\frac{4}{8}*\frac{3}{7} = \frac{5}{42}\]
Therefore, the probability of selected at least 1 white balls is: \[Pr(at least one W) = 1 - Pr(RRR) = 1 - \frac{5}{42} = \frac{37}{42}\]
Let 3W2R = 3 white balls, 2 red balls.
Order is not important, balls are not replaced
Since we are choosing 5 balls from a total of 9 balls, there are (9 choose 5) combinations.
Total number of ways to choose 5 balls = \[Pr(choose5) = \binom95 = \frac{9!}{5!4!} = 126\]
Total number of ways to choose 3 white balls out of a total of 4 white balls \[Pr(3W) = \binom43 = \frac{4!}{3!1!} = 4\]
Total number of ways to choose 2 red balls out of a total of 5 red balls \[Pr(2R) = \binom52 = \frac{5!}{2!3!} = 10\]
Therefore: \[Pr(3W2R) = \frac{Pr(3W)Pr(2R)}{Pr(choose5)} = \frac{4*10}{126} = \frac{20}{63}\]
Let W = 1 point
Let R = 2 points
To score:
3 points we need {WWW}
4 points we need any one of {WWR, WRW, RWW}
\[Pr(3 points) = WWW = \frac{4}{9}*\frac{3}{8}*\frac{2}{7}=\frac{1}{21} \]
\[Pr(4 points) = WWR + WRW + RWW = (\frac{4}{9}*\frac{3}{8}*\frac{5}{7}) + (\frac{4}{9}*\frac{5}{8}*\frac{3}{7}) + (\frac{5}{9}*\frac{4}{8}*\frac{3}{7}) = \frac{5}{14}\]
The probability of getting greater than 4 points is 1 - (probability of getting 3 points or 4 points) \[Pr(>4 points) = 1 - (Pr(3 points) + Pr(4 points))\] \[Pr(>4 points) = 1-(\frac{1}{21}+\frac{5}{14})=1-\frac{17}{42} = \frac{25}{42}\]
To obtain an odd number of points, we need either 1 white ball (eg. WRRRR) or 3 white balls (eg. WWWRR) balls in our event space. We cannot have any more white balls since we have a maximum of 4 white balls in the box.
If A is the event that we have an odd number of points, then: \[Pr(A) = Pr(1W) + Pr(3W)\]
There are \(\binom41\) ways to choose 1White, and \(\binom54\) ways to choose 4Red, so \[Pr(1W) = \frac{\binom41 \binom54}{\binom95} =\frac{10}{63}\]
There are \(\binom43\) ways to choose 3White, and \(\binom52\) ways to choose 2Red, so \[Pr(3W) = \frac{\binom43 \binom52}{\binom95} =\frac{20}{63}\]
Therefore: \[Pr(A) = Pr(1White) + Pr(3White) = \frac{10}{63}+\frac{20}{63} = \frac{10}{21}\]
You are asked to use conditional probability and Bayes Theorem introduced in Module 1 to solve the questions below.
Suppose there are totally 3 production lines (LA, LB and LC) in a factory. LA, LB and LC account for 20%, 30% and 50% of the factory output, respectively. The fraction of defective items produced is 3% for the LA; 4% for LB; and 5% for LC.
Pr(LA) = 0.2
Pr(LB) = 0.3
Pr(LC) = 0.5
Let D be the event of a defective item selected.
Therefore:
Pr(D|LA) = 0.03
Pr(D|LB) = 0.04
Pr(D|LC) = 0.05
The total probability of getting a defective item is the probabilities of all of the lanes producing a defective item. \[Pr(D) = Pr(D|LA)Pr(LA) + Pr(D|LB)Pr(LB) + Pr(D|LC)Pr(LC)\]
\[Pr(D) = (0.03)(0.2) + (0.04)(0.3) + (0.05)(0.5) = 0.043 = 4.3\%\]
Therefore the probability of a good quality item selected is (by the probability axiom that total probability = 1): \[Pr(D^{c})=1-Pr(D)\] \[= 1-0.043 = 0.957 = 95.7\%\]
Probability of a good item = \(95.7\%\)
To calculate which production line most likely produced the defective item, we need to calculate the probability of each production line producing a defect.
Since we already know the probability of a defect for each production line Pr(defect|productionLine), we can apply Bayes’ Theorem to calculate Pr(productionLine|defect).
\[Bayes' Theorem = \frac{Pr(A|B)Pr(B)}{Pr(A)}\]
Pr(Defect came from Line A) = Pr(LA|D) = \(\frac{Pr(D|LA)Pr(LA)}{Pr(D)}\) = \(\frac{(0.03)(0.2)}{(0.043)}*100\) = 13.95%
Pr(Defect came from Line B) = Pr(LB|D) = \(\frac{Pr(D|LB)Pr(LB)}{Pr(D)}\) = \(\frac{(0.04)(0.3)}{(0.043)}*100\) = 27.91%
Pr(Defect came from Line C) = Pr(LC|D) = \(\frac{Pr(D|LC)Pr(LC)}{Pr(D)}\) = \(\frac{(0.05)(0.5)}{(0.043)}*100\) = 58.14%
We can see that it is most likely that the defect came from Production line C as it has the highest probability of producing defects.
Wine Quality data sets contains two files, one for red wine and the other for white wine. Details of two data sets can be found in https://archive.ics.uci.edu/ml/datasets/Wine+Quality. Please provide a correlation analysis of two data sets and outline your findings. Please note that result visualisation is needed.
# You are free to install any required library for visualisation
# If you have error that any of the above library is missing, please install it via install.package(...) or Tools -> Install packages in RStudio
library(ISLR)
library(MASS)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:MASS':
##
## select
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(Hmisc)
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## Loading required package: ggplot2
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
##
## src, summarize
## The following objects are masked from 'package:base':
##
## format.pval, units
library(tidyr)
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
library(ggplot2)
library(gtable)
library(grid)
library(directlabels)
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(reshape2)
##
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
##
## smiths
winRed = read.csv('winequality-red.csv')
winWht = read.csv('winequality-white.csv')
head(winRed)
head(winWht)
colnames(winRed)
## [1] "fixed.acidity" "volatile.acidity" "citric.acid"
## [4] "residual.sugar" "chlorides" "free.sulfur.dioxide"
## [7] "total.sulfur.dioxide" "density" "pH"
## [10] "sulphates" "alcohol" "quality"
head(winRed)
colnames(winWht)
## [1] "fixed.acidity" "volatile.acidity" "citric.acid"
## [4] "residual.sugar" "chlorides" "free.sulfur.dioxide"
## [7] "total.sulfur.dioxide" "density" "pH"
## [10] "sulphates" "alcohol" "quality"
head(winWht)
Looking at the heads and column names of both datasets I can see that they have the same column names. I am particularly interested to see the column ‘quality’. From investigating the data source further, I have discovered that all of the variables contribute to the final ‘quality’ score of the wine. Therefore the quality variable is that it is a score given to the wine based on the other variables.
First I will investigate the structure of the datasets:
str(winRed)
## 'data.frame': 1599 obs. of 12 variables:
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
str(winWht)
## 'data.frame': 4898 obs. of 12 variables:
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
From the above, we can see that: * For winRed: There are 1599 observations and 12 variables. All of the variables are numerical values, except quality which is an integer.
For data integrity I will also check for missing data:
sum(is.na(winRed))
## [1] 0
sum(is.na(winWht))
## [1] 0
There is no missing data! :)
Now, to explore each of each of the variables. I will find the following of each value: * Summary * Peak** * Standard Deviation * IQR
The above information will tell me information about the spread of the data, its concentration, and identify the existence of outliers.
** I am interested in finding the peak value (peak) of each variable. By comparing the peak value to the median and mean, I can envision the position and height of the peak of each graph and further determine the shape of the distribution.
I have written a function getPeak() below which finds the value in the distribution that is the highest on the y-axis, and then finds the corresponding x value. I know that some data will have multiple peaks but I am initially just looking for the highest peak to get an estimation of what the distribution may look like.
getPeak <- function(v) {
yval <- which.max(density(v)$y)
density(v)$x[yval]
}
From the summaries I have observed the following about winRed:
summary(winRed$fixed.acidity)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.60 7.10 7.90 8.32 9.20 15.90
paste("Peak is at", round(getPeak(winRed$fixed.acidity),4))
## [1] "Peak is at 7.3362"
paste("Standard deviation (SD) is ", round(sd(winRed$fixed.acidity),4))
## [1] "Standard deviation (SD) is 1.7411"
paste("IQR is ", round(IQR(winRed$fixed.acidity),4))
## [1] "IQR is 2.1"
The median is a little far from the mean, and peak is very close to 1stQ value, indicating a right skew as further supported by the distance from 3rdQ value to the max value, but peak will be quite skinny as indicated by the IQR value as being very small. Standard deviation is also quite small meaning most of the values will be close to the mean.
winRed Volatile Acidity: The amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste
summary(winRed$volatile.acidity)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3900 0.5200 0.5278 0.6400 1.5800
paste("Peak is at", round(getPeak(winRed$volatile.acidity),4))
## [1] "Peak is at 0.5917"
paste("Standard deviation (SD) is ", round(sd(winRed$volatile.acidity),4))
## [1] "Standard deviation (SD) is 0.1791"
paste("IQR is ", round(IQR(winRed$volatile.acidity),4))
## [1] "IQR is 0.25"
Median is very close to the mean, however peak is greater than both and very close to 3rdQ value. The IQR and SD are both quite small indicating majority of the data points will be clustered around the mean. This would indicated a wide peak.
winRed Citric Acid: Found in small quantities, citric acid can add ‘freshness’ and flavor to wines
summary(winRed$citric.acid)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.090 0.260 0.271 0.420 1.000
paste("Peak is at", round(getPeak(winRed$citric.acid),4))
## [1] "Peak is at 0.0302"
paste("Standard deviation (SD) is ", round(sd(winRed$citric.acid),4))
## [1] "Standard deviation (SD) is 0.1948"
paste("IQR is ", round(IQR(winRed$citric.acid),4))
## [1] "IQR is 0.33"
Median is very close to the mean however the peak and min value are also extremely close, indicating that could be a heavily right skewed distribution. The 1stQ and 3rdQ values are quite far apart, as are the 3rdQ and max values, again supporting a right skew. The standard deviation is quite low, meaning most values are quite concentrated around the mean, and the IQR is quite small, so there is going to be a wide peak around the mean.
winRed Residual Sugar: the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet
summary(winRed$residual.sugar)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.900 2.200 2.539 2.600 15.500
paste("Peak is at", round(getPeak(winRed$residual.sugar),4))
## [1] "Peak is at 2.0393"
paste("Standard deviation (SD) is ", round(sd(winRed$residual.sugar),4))
## [1] "Standard deviation (SD) is 1.4099"
paste("IQR is ", round(IQR(winRed$residual.sugar),4))
## [1] "IQR is 0.7"
Median is close to the peak and 1stQ, the mean is very close to the 3rdQ, and IQR is very small - indicating a very narrow peak. The max value is extremely far from 3rdQ compared to min and 1stQ values, so graph will be heavily skewed to the right. The SD is a little high, but this is probably due to being affected by the maximum value which seems to be an outlier based on 3rdQ value.
winRed Chlorides: the amount of salt in the wine
summary(winRed$chlorides)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100
paste("Peak is at", round(getPeak(winRed$chlorides),4))
## [1] "Peak is at 0.0777"
paste("Standard deviation (SD) is ", round(sd(winRed$chlorides),4))
## [1] "Standard deviation (SD) is 0.0471"
paste("IQR is ", round(IQR(winRed$chlorides),4))
## [1] "IQR is 0.02"
Median is very close the mean, peak, 1stQ and 3rdQ values meaning this could be a skewed distribution. Looking at the difference between 3rdQ and max values, the data is skewed to the right. This is further supported by how much closer the min and 1stQ values are in comparison. Variance and SD and IQR are very low so the graph is going to have a very narrow peak.
winRed Free Sulfur Dioxide: the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine
summary(winRed$free.sulfur.dioxide)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 7.00 14.00 15.87 21.00 72.00
paste("Peak is at", round(getPeak(winRed$free.sulfur.dioxide),4))
## [1] "Peak is at 6.5351"
paste("Standard deviation (SD) is ", round(sd(winRed$free.sulfur.dioxide),4))
## [1] "Standard deviation (SD) is 10.4602"
paste("IQR is ", round(IQR(winRed$free.sulfur.dioxide),4))
## [1] "IQR is 14"
The mean and median are close together however the peak is very far and closer to the 1stQ. This could be a wide peak with what I imagine is a steep increase up to the peak, and then a slow decrease towards the mean, and then tapering off quite quickly towards the max value which is probably an outlier. This is graph is skewed to the right as can be seen by the “difference between 3rdQ and max” vs. “difference between min and 1stQ”.
winRed Total Sulfur Dioxide: amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine
summary(winRed$total.sulfur.dioxide)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 22.00 38.00 46.47 62.00 289.00
paste("Peak is at", round(getPeak(winRed$total.sulfur.dioxide),4))
## [1] "Peak is at 21.3705"
paste("Standard deviation (SD) is ", round(sd(winRed$total.sulfur.dioxide),4))
## [1] "Standard deviation (SD) is 32.8953"
paste("IQR is ", round(IQR(winRed$total.sulfur.dioxide),4))
## [1] "IQR is 40"
The mean and median are quite far apart so these values are quite skewed. The peak is is also quite far away from both and closer to 1stQ. Since the difference between 1stQ and min value is so much smaller compared to the difference between 3rdQ and max, the distribution is heavily right-skewed with a sharp increase to the peak. The IQR is quite small when looking at the range of values, so I would expect the slope in this part of the graph to be gentle.
winRed Density: the density of water is close to that of water depending on the percent alcohol and sugar content
summary(winRed$density)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9901 0.9956 0.9968 0.9967 0.9978 1.0037
paste("Peak is at", round(getPeak(winRed$density),4))
## [1] "Peak is at 0.9968"
paste("Standard deviation (SD) is ", round(sd(winRed$density),4))
## [1] "Standard deviation (SD) is 0.0019"
paste("IQR is ", round(IQR(winRed$density),4))
## [1] "IQR is 0.0022"
The mean, median, and peak are all extremely close, however the range (range = 0.0136) of this graph is not very large, so I expect that this graph will look very close to a normal distribution around the median. the SD and IQR are also extremely small, so the peak of the graph will be quite narrow.
winRed pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale
summary(winRed$pH)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.740 3.210 3.310 3.311 3.400 4.010
paste("Peak is at", round(getPeak(winRed$pH),4))
## [1] "Peak is at 3.3142"
paste("Standard deviation (SD) is ", round(sd(winRed$pH),4))
## [1] "Standard deviation (SD) is 0.1544"
paste("IQR is ", round(IQR(winRed$pH),4))
## [1] "IQR is 0.19"
Mean, median, and peak are again quite close together, the range is not very large (range = 1.27) and SD and IQR are very small, so again I expect this graph to have a narrow peak concentrated around the median. Also it would seem that our pH data behaves within the expected range of wine pH (between 3-4).
winRed Sulphates: a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant
summary(winRed$sulphates)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.5500 0.6200 0.6581 0.7300 2.0000
paste("Peak is at", round(getPeak(winRed$sulphates),4))
## [1] "Peak is at 0.5776"
paste("Standard deviation (SD) is ", round(sd(winRed$sulphates),4))
## [1] "Standard deviation (SD) is 0.1695"
paste("IQR is ", round(IQR(winRed$sulphates),4))
## [1] "IQR is 0.18"
The mean and median are quite close together, however the peak is closer to the 1stQ, indicating a right skew. The difference between 3rdQ and max value is quite large further supporting a heavy right skew. The SD and IQR are also quite small, indicating that the peak is quite narrow.
winRed Alcohol: the percent alcohol content of the wine
summary(winRed$alcohol)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.20 10.42 11.10 14.90
paste("Peak is at", round(getPeak(winRed$alcohol),4))
## [1] "Peak is at 9.501"
paste("Standard deviation (SD) is ", round(sd(winRed$alcohol),4))
## [1] "Standard deviation (SD) is 1.0657"
paste("IQR is ", round(IQR(winRed$alcohol),4))
## [1] "IQR is 1.6"
Mean and median and very close, however peak and 1stQ are the same value, indicating a steep increase to the peak, with right skew. Since mean and 3rdQ are quite close, this further supports right skew. IQR and SD are very small, however the peak is not inside the IQR (or rather it is at the very beginning), so graph is very steep at the beginning, climbing to the peak, then decending a little slower.
winRed Quality: Output variable based on sensory data. Score between 0 and 10.
summary(winRed$quality)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.636 6.000 8.000
paste("Peak is at", round(getPeak(winRed$quality),4))
## [1] "Peak is at 4.9959"
From the summaries I have observed the following about winWht:
summary(winWht$fixed.acidity)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.800 6.300 6.800 6.855 7.300 14.200
paste("Peak is at", round(getPeak(winWht$fixed.acidity),4))
## [1] "Peak is at 6.7443"
paste("Standard deviation (SD) is ", round(sd(winWht$fixed.acidity),4))
## [1] "Standard deviation (SD) is 0.8439"
paste("IQR is ", round(IQR(winWht$fixed.acidity),4))
## [1] "IQR is 1"
The mean, median and peak are almost the same value, with a very small IQR and SD, indicating a very narrow peak Looking at the difference between 3rdQ and max value indicates a right skew of the graph. This is further supported by the smaller difference between min and 1stQ.
winWht Volatile Acidity: The amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste
summary(winWht$volatile.acidity)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0800 0.2100 0.2600 0.2782 0.3200 1.1000
paste("Peak is at", round(getPeak(winWht$volatile.acidity),4))
## [1] "Peak is at 0.2506"
paste("Standard deviation (SD) is ", round(sd(winWht$volatile.acidity),4))
## [1] "Standard deviation (SD) is 0.1008"
paste("IQR is ", round(IQR(winWht$volatile.acidity),4))
## [1] "IQR is 0.11"
Mean, median and peak are all quite close together, with a very small IQR and SD, indicating a narrow peak, however the difference between 3rdQ and max value is quite high compared to the difference between min and 1stQ, indicating a right skew.
winWht Citric Acid: Found in small quantities, citric acid can add ‘freshness’ and flavor to wines
summary(winWht$citric.acid)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.2700 0.3200 0.3342 0.3900 1.6600
paste("Peak is at", round(getPeak(winWht$citric.acid),4))
## [1] "Peak is at 0.2945"
paste("Standard deviation (SD) is ", round(sd(winWht$citric.acid),4))
## [1] "Standard deviation (SD) is 0.121"
paste("IQR is ", round(IQR(winWht$citric.acid),4))
## [1] "IQR is 0.12"
Most of the values are close to each other and close to the mean value, with a small IQR and SD, I would expect a narrow peak. Observing the difference between the 3rdQ and max value indicates a right skew.
winWht Residual Sugar: the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet
summary(winWht$residual.sugar)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.600 1.700 5.200 6.391 9.900 65.800
paste("Peak is at", round(getPeak(winWht$residual.sugar),4))
## [1] "Peak is at 1.5313"
paste("Standard deviation (SD) is ", round(sd(winWht$residual.sugar),4))
## [1] "Standard deviation (SD) is 5.0721"
paste("IQR is ", round(IQR(winWht$residual.sugar),4))
## [1] "IQR is 8.2"
Median and mean are close, however the peak is very far. Also the difference between 3rdQ and max value is very high, indicating a heavily right skew. The SD and IQR are both quite small, so I expect a heavy right skew. The max value is 65.8 so there are some sweet wines in this dataset.
winWht Chlorides: the amount of salt in the wine
summary(winWht$chlorides)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600
paste("Peak is at", round(getPeak(winWht$chlorides),4))
## [1] "Peak is at 0.0453"
paste("Standard deviation (SD) is ", round(sd(winWht$chlorides),4))
## [1] "Standard deviation (SD) is 0.0218"
paste("IQR is ", round(IQR(winWht$chlorides),4))
## [1] "IQR is 0.014"
Median, mean and peak are all very close, and SD and IQR are extremely small, indicating a very narrow peak. Difference between 3rdQ and max value indicate a heavy right skew. Most wines seem to have low chloride levels, with some exceptions.
winWht Free Sulfur Dioxide: the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine
summary(winWht$free.sulfur.dioxide)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.00 23.00 34.00 35.31 46.00 289.00
paste("Peak is at", round(getPeak(winWht$free.sulfur.dioxide),4))
## [1] "Peak is at 30.4645"
paste("Standard deviation (SD) is ", round(sd(winWht$free.sulfur.dioxide),4))
## [1] "Standard deviation (SD) is 17.0071"
paste("IQR is ", round(IQR(winWht$free.sulfur.dioxide),4))
## [1] "IQR is 23"
Mean and median are close together, with the peak a little lower. The SD is quite large, but the IQR is very small, indicating a narrow peak but there are data outliers. The difference between 3rdQ and max indicates a heavy right skew. There is a large range of values here.
winWht Total Sulfur Dioxide: amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine
summary(winWht$total.sulfur.dioxide)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.0 108.0 134.0 138.4 167.0 440.0
paste("Peak is at", round(getPeak(winWht$total.sulfur.dioxide),4))
## [1] "Peak is at 117.5997"
paste("Standard deviation (SD) is ", round(sd(winWht$total.sulfur.dioxide),4))
## [1] "Standard deviation (SD) is 42.4981"
paste("IQR is ", round(IQR(winWht$total.sulfur.dioxide),4))
## [1] "IQR is 59"
The mean and median are quite close together, however since the median is lower than the mean, this indicates a right skew. Further supported by the difference between 3rdQ and max -> the max value is significantly higher. the SD and IQR are quite high, indicating that the peak will be quite wide. Majority of wines in the winWht dataset have evident SO2 values.
winWht Density: the density of water is close to that of water depending on the percent alcohol and sugar content
summary(winWht$density)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9871 0.9917 0.9937 0.9940 0.9961 1.0390
paste("Peak is at", round(getPeak(winWht$density),4))
## [1] "Peak is at 0.9921"
paste("Standard deviation (SD) is ", round(sd(winWht$density),4))
## [1] "Standard deviation (SD) is 0.003"
paste("IQR is ", round(IQR(winWht$density),4))
## [1] "IQR is 0.0044"
The mean, median, and peak are all extremely close, and the SD and IQR are also very small, so this graph is very heavily clustered around the mean. However, the difference in value between 3rdQ and max is 0.0429, which is quite high compared to the differences between the other values. This indicates a heavily right-skewed graph.
winWht pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale
summary(winWht$pH)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.720 3.090 3.180 3.188 3.280 3.820
paste("Peak is at", round(getPeak(winWht$pH),4))
## [1] "Peak is at 3.1547"
paste("Standard deviation (SD) is ", round(sd(winWht$pH),4))
## [1] "Standard deviation (SD) is 0.151"
paste("IQR is ", round(IQR(winWht$pH),4))
## [1] "IQR is 0.19"
Mean, median, and peak are again quite close together, and IQR and SD are very small, so this graph is very heavily concentrated around the mean and peak is narrow.
summary(winWht$sulphates)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2200 0.4100 0.4700 0.4898 0.5500 1.0800
paste("Peak is at", round(getPeak(winWht$sulphates),4))
## [1] "Peak is at 0.4549"
paste("Standard deviation (SD) is ", round(sd(winWht$sulphates),4))
## [1] "Standard deviation (SD) is 0.1141"
paste("IQR is ", round(IQR(winWht$sulphates),4))
## [1] "IQR is 0.14"
Mean and median are quite close together, with the median less than the mean. This indicates a right skew. Further supported by the difference between the 3rdQ and max values being greater than the difference between the min and 1stQ values. The IQR is quite small, as is the SD so I would imagine the peak to be narrow.
summary(winWht$alcohol)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 9.50 10.40 10.51 11.40 14.20
paste("Peak is at", round(getPeak(winWht$alcohol),4))
## [1] "Peak is at 9.366"
paste("Standard deviation (SD) is ", round(sd(winWht$alcohol),4))
## [1] "Standard deviation (SD) is 1.2306"
paste("IQR is ", round(IQR(winWht$alcohol),4))
## [1] "IQR is 1.9"
Mean and median and very close, and peak and 1stQ are very close indicating this is a slightly flat graph. The SD and IQR are a little high so it is a wider flat graph.
winWht Quality: Output variable based on sensory data. Score between 0 and 10.
summary(winWht$quality)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.878 6.000 9.000
getPeak(winWht$quality)
## [1] 5.993408
## winRed Variables Count Plot
The above plots show the count values of all the variables. Some of these are a little inaccurate due to a standard binwidth being applied to all of the graphs (eg. citric acid could be further refined), but it gives a good high level description of the distribution. From here I can see that a lot of the data has a right skew, so there are a lot of outliers in our data.
Alcohol seems to have a wide spread, and citric acid has a strange graph. It would warrant further investigation.
Along the diagonal, I can see the density plots of each of the variables. It is clear that most of these data points are right-skewed, with only density and pH being around the centre. The density plot of citric acid is quite interesting to me as it is an odd shape. I’d be interested to see if and how it affects quality.
From the above correlation matrix, I can see the following correlations with Quality:
Quality has a weak positive correlation with alcohol, sulphates, and citric acid
Quality has a weak negative correlation with volatile acidity
Other variables that have correlations:
It makes sense for free sulfur dioxide to be strongly positively correlated with total sulfur dioxide, since the total amount includes the free amount + bound sulfate amounts. I think the relationship between citric acid and fixed acidity makes sense as well, since citric acid is a type of fixed acid, so it would make sense for both of these values to be strongly positively correlated. However , initially I was not confident of the strong correlation with density and fixed acidity. However after some research I have discovered that the acids that make up the fixed acidity component (tartaric, malic, and citric) all have densities that are greater than water, so it makes sense that the higher the fixed acidity level, the higher the density.
Similarly, for the values with strong negative correlations they make logical sense. The more acidic a substance becomes, the lower the pH level, so it makes sense for the pH to be strongly negatively correlated with the acid values.
For alcohol and density, this is also logical. Ethanol has a density that is less than water, so the more alcohol that the wine has, the lower the density. However it is interesting to note that although quality and alcohol have a positive correlation (0.476), quality and density seem to have a negative correlation (-0.175) - this does not really make sense since density is directly affected by alcohol, so there must be other factors that are affecting the density score. I am not sure I understand the strong negative correlation between citric acid and volatile acidity. Could it be that since citric acid is a type of fixed acid, its presence reduces the volatile acidity flavour?
One point that sticks out to me as being quite odd is that there is a positive relation between pH and volatile acidity. This does not make any sense, since a lower pH indicates a higher acidity. I will look into this deeper later.
From the scatter plots along the lower half of the diagonal, the data clearly shows the strong positive and negative correlations, but we can also see the distribution of outliers. For example, if we removed the outliers from the density-residual sugar relationsip, I would imagine that this relationship would be much stronger.
First I will focus on diving deeper into the variables that affect quality.
Looking deeper at the quality variable, I will graph it against all other variables: We can see that the top 3 variables positively affecting quality are:
Alcohol
Sulphates
Citric Acid
It appears that residual sugar has almost no effect on quality.
The top 3 variables negatively affecting quality are:
Volatile Acidity
Total Sulfur Dioxide
Density
I would like to take a look at the pH vs. volatile acidity now. From the correlation matrix, this has a positive correlation which does not make sense - a higher pH indicates a more basic solution. Adding acid should reduce the pH.
It can be seen that most of the points on the graph behave slightly differently to what we would expect a graph between acidity and pH to look like. I would expect that the more the acidity level increases, the lower the pH. However we can see that there are a number of points here where even though the acidity level is the same (or very similar), they have vastly different pH levels. This indicates to me that some of the other variables may be affecting this graph.
We can see that volatile acidity is strongly negatively correlated with fixed acidity, citric acid, sulphates, and alcohol (it is also strongly negatively correlated with quality, however volatile acidity is an input to quality, and not the other way around. Therefore quality cannot affect volatile acidity, but volatile acidity DOES affect quality). It could be that one of these variables has affected the volatile acidity points, resulting in an odd relationship with pH.
Similar to the winRed data, many of the variables have a strong right skew in winWht. Alcohol has a very wide spread, as does pH and sulphates. The strong right skew indicates that there are a number of outliers in the data. The graphs of chlorides, citric acid, density and free sulfur dioxide all have narrow peaks, indicating that is not much spread in these variables, so it will be interesting to see if and how they affect quality.
The quality graph indicates that most of these wines have an average rating.
From the density plots along the diagonal, we can see that a lot of the data has a right skew, except for pH which is to be expected - wine pH should be mostly acidic. Alcohol has an interesting graph as does residual sugar and density - both of which have a number of peaks. Citric acid also has two peaks.
From the correlation heatmap along the top of the matrix, I can see a number of strong correlations. Firstly in regards to quality:
Quality has a weak positive correlation with alcohol
Quality has a weak negative correlation with density
Some other interesting correlations between variables:
The strong positive correlations between density and residual sugar makes logical sense, since the more sugar the wine has the more dense the liquid will be. The relation between total sulfur dioxide and free sulfur dioxide is expected and was also seen in the winRed dataset.
The relationship between total sulfur dioxide (TSO2) and density is interesting and I am not sure I completely understand. From my research, TSO2 exists as a gas in the wine to prevent it from developing bacteria, so I don’t understand how adding a gas can increase density. Scientifically this should spread the liquid particles further apart and make the liquid less dense. I will look into this further.
The strong negative correlations in relation to alcohol are all logical and can be expected. Since alcohol is less dense than water, it makes sense that as more alcohol increases, density decreases.
I have done some research about the relationship between alcohol and TSO2, and it turns out that “Ethanol acts synergistically and enhances the bacteria-killing effect of molecular SO2, so high-alcohol wines require less SO2 protection” (Source: https://www.extension.purdue.edu/extmedia/fs/fs-52-w.pdf), so again it makes sense that as the level of alcohol increases, the level of SO2 decreases.
Alcohol and residual sugar also makes sense, as the residual sugar is the remaining sugar after the alcoholic fermentation process finishes, so the less sugar that remains in the wine, the more alcohol that is produced from the fermentation process.
The relationship between pH and fixed acidity is logical - as the acid levels increase, pH is expected to decrease.
I will look further at the quality data to determine if we can get any further information on what variables affect it:
We can see that the top 3 variables positively affecting quality are:
Alcohol
pH
Sulphates (but very minimally)
It appears that most of the variables negatively affect the quality, with the exception of alcohol
The top 3 variables negatively affecting quality are:
Density
Chlorides
Volatile Acidity
I will now take a closer look at TSO2 and Density - as far as I understand, I believe these should have a negative correlation. ### Total Sulfur Dioxide vs. Density
From the above graph, I can see that there are a few outliers could possibly be skewing the graph to become positive. Based on the rest of the data, without these outliers, I would imagine the linear model being very slightly negative, which would support the idea that an increase in TSO2 would lead to a lower density. I would need to map this out to prove this.
##SUMMARY From the two wine datasets, I have found that the variable that affected the quality the greatest was alcohol. This held true for both winRed and winWht data. It was interesting to see the difference in how the variables affected the two different wine types.
Where some variables affected winRed quality positively (eg. fixed acidity) they had little or negative effect on winWht quality. The only consistent value that was shown to affect quality strongly and positively in both data sets was alcohol. The values that negatively affected the quality were consistent across both however. This includes volatile acidity, density, and chlorides. This is understandable since the higher the density, the less alcohol that is present, leading to a lower quality score. The presence of volatile acidity and chlorides has been shown to have a negative impact - again this is logical since these components add a vinger-y taste and smell to the wine.
I am curious about how much to rely on the findings from the winRed data set, since there are significantly less data points in this data set than there are in the winWht. If I had a similar sized data set, then maybe I would be able to draw more of a comparison between red and white wine. However, I am reluctant to do so with the current data set as I don’t think it would provide a good sample.